Automatic task distribution in scalable processors

ABSTRACT

The present invention relates to a processing method and apparatus for processing an information based on a sequence of instructions, wherein a repeated sub-sequence is detected in the sequence of instructions and an allocation between a processing resource and said repeated sub-sequence is determined based on an index information indicating the repetition frequency of the repeated sub-sequence. Thus, a combination of a scalable signal processor with automatic task distribution is provided, by means of which the number of memory accesses can be reduced, as the repeated sub-sequence can be allocated to external processing units, which are correspondingly programmed or which use their embedded memory. This also saves power.

The present invention relates to a method and apparatus for processingan information based on a sequence of instructions, especially to amethod of scheduling processing resources in a scalable digital signalprocessor.

Digital signal processors (DSPs) are designed to execute tasks with hardreal-time constraints. Therefore, emphasis is directed on computingpower. There are several factors which determine the computing power.Probably, the most obvious factor is clock frequency, but this iscertainly not the only factor. Other important factors are theexpressive power of an operation (i.e. the instruction set), the numberof clock cycles necessary to execute an operation, the number ofoperations that can be executed in parallel, the amount of pipeliningbetween consecutive operations, the penalty in terms of wasted clockcycles when branches are executed, etc.

In DSP processors the basic concept to boost performance is exploitinginstruction level parallelism. In the present application, it will bereferred to an instruction as the complete piece of programmeinformation which is supplied to a processor core in a single clockcycle. Thus, a single instruction can imply the execution of severaloperations in parallel in the same clock cycle. To allow the concurrentexecution of operations in a processor, three architectural measures canbe taken in the hardware. These are overlapped execution, functionalunit pipelining, and multiple functional units. Overlapped executionmeans that the processor is working on multiple instructionssimultaneously, that is, multiple instructions are active, each in adifferent phase of fetch, decode, and execute.

To exploit instruction level parallelism, the dependency betweenoperations must be determined. Taking these dependencies into account,the operations must be scheduled at some particular time on someparticular functional unit, and registers into which the results can bedeposited must be assigned. Exploiting instruction level parallelism isa task of either the programmer, the compiler, or the runtime hardware.Depending on the architectural approach which is taken, emphasis is onone of the three approaches.

In traditional DSPs, instructions are executed in sequences with acapability of jumping from one position in the sequence to another,depending on the current state. This capability leads to a situationwhere a given application is composed of instructions executed once andinstructions or short sequences of them executed more than once. If anapplication is monitored during its execution on a given DSP, a profilecan be generated, where different parts of the application are given adifferent repetition index r_(i).

FIG. 1 shows a schematic diagram in which a sequence of instructions isindicated by a sequence of horizontal bars, each bar corresponding to asingle instruction. Furthermore, those instructions included within adotted frame form repeated sub-sequences of instructions. The repetitionindex r_(i) indicated at each repeated sub-sequence represents therepetition rate, wherein the index r_(i)=0 corresponds to the mostfrequently executed sub-sequence and no index means that the instructionis executed only once. Thus, in the sequence of instructions shown inFIG. 1, the repeated sub-sequence indicated in the middle part is thesub-sequence which is most frequently executed, the upper repeatedsub-sequence is the second most frequently executed sub-sequence, andthe lower repeated sub-sequence is the third most frequently executedsub-sequence.

However, if such a sequence of instructions is executed by a singleprocessor core of a DSP, the instruction memory is accessed severaltimes for fetching the same sub-sequence which delays processing andreduces performance.

Document U.S. Pat. No. 5,796,989 discloses a system for optimising theexecution of instructions undertaken by a processor. In particular,instruction code sequences are reorganised, so that the native code usedto emulate instructions which occur most frequently are groupedtogether. As a result, when the native code for a frequently occurringemulated instruction is loaded from the memory into the cache, thelikelihood that the cache will contain the native code for subsequentemulated instructions is maximised. Thus, most frequently receivedinstructions are stored in an easily accessible manner, reducingprocessing latency.

It is an object of the present invention to provide a method andapparatus for processing an information based on a sequence ofinstructions, by means of which the processing efficiency can be furtherimproved.

This object is achieved by a method as claimed in claim 1, an apparatusas claimed in claim 12, and a compiler as claimed in claim 18.

Accordingly, due to the fact that the repetition rate of the repeatedsub-sequences is identified, resources consuming high repetitionsequences can be deferred to external processing units, while theremaining sequences with low repetition rate are executed in theconventional manner by the core processor. Thereby, the performance canbe improved by adding external processing resources. Furthermore, aflexible processing system can be provided, due to the fact that thekind and number of external processing units may individually beselected. As the most repetitive sub-sequences are outsourced to theexternal processing units, accesses to the instruction memory arereduced to thereby reduce system power requirements.

If a signalling function is provided, by means of which externalprocessing units may signal their availability to the processor core, aplug'n'play operating mode can be established, where the externalprocessing units may be selectively added and automatically consideredin the task distribution. Due to the execution overlap between thedifferent external resources and the self-expandable processor itself(core processor), an increased performance can be provided, while aminimum performance, corresponding to the performance of the coreprocessor as such, is always guaranteed. Due to the flexibility of thearrangement, compatibility with current and foreseeable DSP generationscan be assured.

Advantageous further developments of the present invention are definedin the dependent claims.

Preferably, an instruction containing the index information may begenerated and added to the sequence of instructions. The indexinformation may comprise an integer number set in proportion with aranking of the repetition rate of the repeated sub-sequence compared tothe repetition rate of other detected repeated sub-sequences. Inparticular, the allocation may be determined by comparing the integernumber with the number of available sources. Then, all repeatedsub-sequences for which the integer number is smaller than the number ofavailable processing resources are allocated to a selected processingresource.

Furthermore, the index information may comprise an informationindicating the number of instructions in the repeated sequence.

If the repeated sub-sequence is no longer detected for a predeterminedtime period, an instruction is generated for deleting the repeatedsub-sequence, and a processing unit to which the deleted repeatedsub-sequence was allocated is reset.

Additionally, an instruction may be generated for specifying processingregisters used by the repeated sub-sequence, wherein the instruction isused for locking the specified processing registers.

A processing resource may be activated when the instruction containingthe index information indicates that the corresponding repeatedsub-sequence has already been allocated to the processing resource. Inthis case, the activating step may comprise the step of programmingother processing source according to the corresponding repeatedsub-sequence, or uploading the corresponding repeated sub-sequence to amemory of the processing resource.

The presence of external processing units may be signalled to a centralprocessing unit, and the number of available external processing unitsmay be counted based on the signalling.

In the processing apparatus, connecting means may be provided forconnecting at least one external processing unit to which the repeatedsub-sequence can be allocated. Furthermore, a memory table may beprovided for storing an allocation information indicating allocationbetween the at least one external processing unit and correspondingrepeated sub-sequences. The external processing units may compriseprocessing cores and/or configurable logic blocks.

Additionally, mapping means may be provided for mapping the repeatedsub-sequence to an available one of the at least one external processingunit based on the index information.

The compiler may be arranged to add to the repeated sub-sequence aninstruction specifying the index information. The additional instructionmay be added so as to precede the repeated sub-sequence.

Furthermore, the compiler may be arranged to add the instructionindicating that the repeated sub-sequence is not used anymore and/or theinstruction for specifying the processing registers used by the repeatedsub-sequence. Furthermore, the compiler may be arranged to determine theranking of the repeated sub-sequences based on their repetition rate.

In the following, the present invention will be described in greaterdetail on the basis of a preferred embodiment with reference to theaccompanying drawings in which:

FIG. 1 shows a schematic, diagram of a sequence of instructionscomprising repeated sub-sequences;

FIG. 2 shows a simplified diagram of a self-expandable digital signalprocessor, in which the present invention can be implemented;

FIG. 3 shows a simplified block diagram indicating a mapping of repeatedsub-sequences into available external processing units, according to thepreferred embodiment of the present invention;

FIG. 4 shows a schematic diagram of a processing apparatus according tothe preferred embodiment; and

FIG. 5 shows a flow diagram of a processing method according to thepreferred embodiment.

The preferred embodiment will now be described on the basis of aself-expandable DSP (Se-DSP) as indicated in FIG. 2.

According to FIG. 2, the Se-DSP 10 comprises a plurality of ports towhich processing resources 20 to 23 can be connected. In this context, aresource is either a re-configurable core or a small processing corewith embedded memory. Due to this configuration, the Se-DSP 10 isallowed to run an application in its optimal form depending on theprocessing resources 20 to 23 available at the time up to a maximumlimit of resources. In the case of FIG. 2, up to four additionalprocessing resources or units can be connected or attached, which may beconfigurable logic blocks, i.e. Field Programmable Gate Arrays (FPGAs)or processor cores provided with their own memory.

FIG. 3 shows a simplified diagram of a mapping function for mapping therepeated sequences of FIG. 1 to the available processing resources orunits (Co-units) in the Se-DSP 10. According to the preferredembodiment, a mechanism is provided by which the Se-DSP 10 can map therepeated or repetitive sequences with higher repetition rate on theavailable additional processing units 20 to 23. Thereby, as soon as sucha repeated sub-sequence is detected in the Se-DSP 10, the correspondingprocessing is handed over or allocated to a predetermined one of theprocessing units 20 to 23, such that the Se-DSP 10 may continueprocessing based on the subsequent instruction following the allocatedrepeated sub-sequence. Thereby, repeated sub-sequences can be processedconcurrently, while the Se-DSP 10 continues processing of the remainingless repetitive instructions or sub-sequences.

FIG. 4 shows a schematic block diagram of the processing in the Se-DSP10, wherein an original programme code is compiled in a compiler 30. Thecompiler is arranged to identify repeated sequences of instructions,i.e. loops or function calls, and to determine at compiling time aranking of these sequences or sub-sequences based on their repetitionrate. The signalling of the repeated sub-sequences may be based on atleast one additional instruction by means of which an information aboutthe repeated sub-sequences can be notified to the Se-DSP 10 at executiontime.

Furthermore, a mechanism is provided, by which the Se-DSP 10 candetermine how many additional external processing units 20 to 2n areattached. Furthermore, another mechanism is provided by which the Se-DSP10 can map the detected repeated sub-sequences with higher repetitionrate on the available processing units 20 to 2n.

To achieve this, the Se-DSP 10 is provided with an additionalinstruction, e.g. called rep_index and an internal memory or table 40.The additional instruction is used by the compiler 30 to delimit arepetitive sequence of instruction. It provides the number ofinstructions in the sequence and the repetition index r_(i). Accordingto the preferred embodiment, the repetition index r_(i) is a numbergreater or equal to zero, with zero being the index for the sub-sequencewith the highest repetition rate. Therefore, if a number n_(r) ofadditional resources is attached, all sub-sequences with repetitionindex r_(i) less than n_(r) can be mapped to the additional resources,e.g. the processing units 20 to 2n.

The internal memory or table 40 is used to store information about therepeated sub-sequences. The table may have one entry per possibleprocessing unit. Every time a repeated sub-sequence is mapped onto anadditional processing unit, a corresponding entry is set, i.e. the portname of the respective processing unit is written into the table 40. Thetable 40 is indexed by the repetition index r_(i).

Accordingly, a generic DSP is allowed to run an application in itsoptimal form depending on the available processing resources. Thegeneric DSP architecture is extended by a number of ports to whichprocessing resources can be connected, wherein a processing resourcecorresponds either to a configurable core or a small processing corewith embedded memory, or a programmable logic unit. Furthermore, theinternal table 40 and at least one additional instruction is required tospecify a sequence of instructions executed more than once and itsrepetition index. From the software side, the compiler is arranged togenerate the repetition indexes r_(i). The generation of the repetitionindex r_(i) may be based on the generation of similar statistics asperformed in modern VLIW (Very Large Instructional Word) compilers.

Thus, according to the preferred embodiment, repeated sequences ofinstructions with higher repetition rate are deferred to the availableconnected processing resources and the remaining instruction code isexecuted in a conventional manner by the DSP.

According to the preferred embodiment, two additional instructions maybe added. A first discard instruction may be used to inform the Se-DSP10 to delete a repeated sub-sequence with a repetition index r_(i)specified by the instruction itself. In this case, the processing uniton which this repeated subsequene was mapped is reset, which means thatit is put into its initial or reset state. Then, a repeated sub-sequencewith lower repetition rate, i.e. higher repetition index r_(i), can bemapped on this reset processing unit.

Furthermore, a second mask instruction can be used to specify internalregisters of the Se-DSP 10, which will be used by the respectiverepeated sub-sequence. This mask instruction follows the repetitionindex instruction. By this mask instruction, the specified registers areeffectively locked, i.e. their use is not allowed until thecorresponding repeated sub-sequence has been completed. In this way, theexecution of a repeated sub-sequence will not stall the Se-DSP 10 untilthe moment when the application tries to access one of the specifiedregisters. Thus, the register locking provides the advantage that theexecution of the repeated sub-sequences does not stall or lock theremaining processing resources. Thereby, concurrent execution ofrepeated and non-repeated instructions is possible. Once a repeatedsub-sequence is mapped and stored in the internal table 40 of the Se-DSP10, it will not be fetched again by the Se-DSP 10. In fact, when therepetition index instruction for a mapped sub-sequence which is alreadystored or registered in the internal table 40 is detected, a branchoperation is initiated and the mapped processing unit is activated.Hence, access to the instruction memory (not shown in FIG. 4) isreduced.

The number of available additional processing units 20 to 2n isdetermined by the Se-DSP 10 e.g. right after a reset operation. Forexample, each external processing unit 20 to 2n may signal its presenceby means of a simple signal, e.g. a 1-bit signal, and the Se-DSP 10 maysimply count the signals received from the additional processing units20 to 2n. If no additional unit is present, the Se-DSP 10 behaves as aconventional DSP.

The mapping of the repeated sub-sequences to the processing units 20 to2n depends on the nature or kind of the respective processing unit. Incase of a FPGA, mapping may be performed by correspondingly programmingthe FPGA. In case of processing cores provided with a memory, mappingmay be performed by uploading the repeated sub-sequence to the memory ofthe processing core. The processing units 20 to 2n have access to theregister file of the Se-DSP 10 in any conventional manner. They may bearranged on the same integrated circuit as the Se-DSP 10 or may beprovided on an external circuit.

FIG. 4 shows a schematic flow diagram indicating a processing operationof the Se-DSP 10.

In a first step S100, the core of the Se-DSP 10 detects whether anyexternal resource, e.g. additional processing unit 20 to 2n, isavailable. This may be achieved by counting the correspondingnotification signals received from the external processing resources. Ifresources are available, the number n_(r) of the external processingunits 20 to 2n is stored in the internal table 40 in step S101, and theexternal processing units 20 to 2n are put into their reset state instep S102. Then, the application is started and the first instruction isread from the instruction memory in step S103. If no external processingresource is detected in step S100, the Se-DSP 10 behaves as aconventional DSP without any matching function for repeatedsub-sequences.

In step S104, it is checked whether the read instruction indicates anassigned sequence already stored in the internal table 40. If so, thecorresponding processing resource, e.g. processing unit, is activated instep S108 and the procedure returns to step 103 to read the subsequentinstruction. If the read instruction does not indicate an assignedsequence in step S104, it is checked whether the read instructionindicates a repetitive or repeated sub-sequence in step S105. If so, itis checked in step S106 whether the repetition index r_(i) indicated bythe corresponding repetition index instruction is smaller than thenumber n_(r) of the available processing resources 21 to 2n. If so, therespective repeated sub-sequence is assigned to an available processingresource in step S107 and the selected processing resource is activatedin step S108. Furthermore, a corresponding entry is added to theinternal table 40 specifying the selected processing resource. Theassignment in step S107 may be effected by storing the repeatedsub-sequence in the internal memory of the processing resource or usingthe repeated sub-sequence to configure the processing resource. If theread instruction does not indicate any repeated sub-sequence or therepetition index r_(i) is not smaller than the number of availableprocessing resources, the procedure precedes to step S109, where theread instruction is executed in the Se-DSP 10 in a conventional manner.Then the flow returns to step S103 in order to read the subsequentinstruction. The core of the Se-DSP 10 starts executing a subsequentinstruction following a repeated or assigned sub-sequence until theresult or results of such a sub-sequence is required for the subsequentinstruction.

Thus, sub-sequences or tasks can be flexibly assigned to externalprocessing units, e.g. co-processors available to the Se-DSP 10. Therebymemory access to the instruction memory can be reduced as the externalprocessing units used their embedded memory or are correspondinglyprogrammed. This also saves power.

It is to be noted that the present invention is not restricted to thepreferred embodiment described above, but can be used in any scalabledata processing architecture in which an information is processed basedon a sequence of instructions. The preferred embodiment may thus varywithin the scope of the attached claims.

1. A method for processing an information based on a sequence ofinstructions, said method comprising the steps of: a) detecting arepeated sub-sequence in said sequence of instructions; b) providing anindex information indicating the repetition frequency of said repeatedsub-sequence; and c) determining an allocation between a processingresource and said repeated sub-sequence based on said index information.2. A method according to claim 1, further comprising the step ofgenerating an instruction containing said index information, and addingsaid instruction to said sequence of instructions.
 3. A method accordingto claim 1, wherein said index information comprises an integer numberset in proportion with a ranking of said repetition rate of saidrepeated sub-sequence compared to the repetition rate of other detectedrepeated sub-sequences.
 4. A method according to claim 3, wherein saidallocation is determined by comparing said integer number with thenumber of available processing resources (20-2n).
 5. A method accordingto claim 4, wherein all repeated sub-sequences for which said integernumber is smaller than said number of available processing resources areallocated to a selected processing resource.
 6. A method according toclaim 1, wherein said index information comprises an informationindicating the number of instructions in said repeated sub-sequence. 7.A method according to claim 1, further comprising the step of generatingan instruction for deleting said repeated sub-sequence, if said repeatedsub-sequence is no longer detected for a predetermined time period, andresetting a processing unit to which said deleted repeated sub-sequencewas allocated.
 8. A method according to claim 1, further comprising thestep of generating an instruction for specifying processing registersused by said repeated sub-sequence, and using said instruction forlocking said specified processing registers.
 9. A method according toclaim 2, further comprising the step of activating a processing resource(20-2n) when said instruction containing said index informationindicates that the corresponding repeated sub-sequence has already beenallocated to said processing resource.
 10. A method according to claim9, wherein said activating step comprises the step of programming saidprocessing resource according to said corresponding repeatedsub-sequence, or uploading said corresponding repeated sub-sequence to amemory of said processing resource.
 11. A method according to claim 1,further comprising the step of signalling the presence of externalprocessing units (20-2n) to a central processing unit (10), and countingthe number of available external processing units based on saidsignalling.
 12. An apparatus for processing an information based on asequence of instructions, said apparatus comprising: a) detecting means(30) for detecting a repeated sub-sequence in said sequence ofinstructions, and for providing an index information indicating therepetition frequency of said repeated sub-sequence; and b) resourcecontrol means (10) for allocating said repeated sub-sequence to aprocessing resource based on said index information.
 13. An apparatusaccording to claim 12, further comprising connecting means forconnecting at least one external processing unit (20-2n) to which saidrepeated sub-sequence can be allocated.
 14. An apparatus according toclaim 13, further comprising a memory table (40) for storing anallocation information indicating an allocation between said at leastone external processing unit (20-2n) and corresponding repeatedsub-sequences.
 15. An apparatus according to claim 13, wherein saidapparatus is a digital signal processor (10) and said at least oneexternal processing units (20-2n) are processor cores and/orconfigurable logic blocks.
 16. An apparatus according to claim 13,further comprising means for determining the number of said at least oneexternal processing units (20-2n) connected to said connecting means.17. An apparatus according to claim 13, further comprising mapping meansfor mapping said repeated sub-sequence to an available one of said atleast one external processing unit (20-2n) based on said indexinformation.
 18. A compiler for providing an output sequence ofinstructions to be used for processing an information, said compilerbeing arranged to detect a repeated sub-sequence in said output sequenceof instructions and to provide an index information indicating therepetition frequency of said repeated sub-sequence.
 19. A compileraccording to claim 18, wherein said compiler (30) is arranged to add tosaid repeated sub-sequence an instruction specifying said indexinformation.
 20. A compiler according to claim 19, wherein saidadditional instruction is added so as to precede said repeatedsub-sequence.
 21. A compiler according to claim 18, wherein saidcompiler (30) is arranged to add to said output sequence an instructionfor indicating that said repeated sub-sequence is not used anymore. 22.A compiler according to claim 18, wherein said compiler (30) is arrangedto add to said output sequence an instruction for allocating at leastone processing register means until said repeated sub-sequence isfinished.
 23. A compiler according to claim 18, wherein said compiler(30) is arranged to determine a ranking of repeated sub-sequences basedon their repetition rate.